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Abstract 

We study the learnability of linear separators in -ft 1 in the presence of bounded (a.k.a Massart) noise. 
This is a realistic generalization of the random classification noise model, where the adversary can flip 
each example x with probability rj(x) < //. We provide the first polynomial time algorithm that can learn 
linear separators to arbitrarily small excess error in this noise model under the uniform distribution over 
the unit ball in 5t d , for some constant value of //. While widely studied in the statistical learning theory 
community in the context of getting faster convergence rates, computationally efficient algorithms in this 
model had remained elusive. Our work provides the first evidence that one can indeed design algorithms 
achieving arbitrarily small excess error in polynomial time under this realistic noise model and thus 
opens up a new and exciting line of research. 

We additionally provide lower bounds showing that popular algorithms such as hinge loss minimiza¬ 
tion and averaging cannot lead to arbitrarily small excess error under Massart noise, even under the 
uniform distribution. Our work instead, makes use of a margin based technique developed in the context 
of active learning. As a result, our algorithm is also an active learning algorithm with label complexity 
that is only a logarithmic the desired excess error e. 


1 Introduction 

Overview Linear separators are the most popular classifiers studied in both the theory and practice of 
machine learning. Designing noise tolerant, polynomial time learning algorithms that achieve arbitrarily 
small excess error rates for linear separators is a long-standing question in learning theory. In the absence 
of noise (when the data is realizable) such algorithms exist via linear programming ifTTI . However, the 
problem becomes significantly harder in the presence of label noise. In particular, in this work we are 
concerned with designing algorithms that can achieve error OPT + e which is arbitrarily close to OPT, the 
error of the best linear separator, and run in time polynomial in \ and d (as usual, we call e the excess error). 
Such strong guarantees are only known for the well studied random classification noise model Q. In this 
work, we provide the first algorithm that can achieve arbitrarily small excess error, in truly polynomial time, 
for bounded noise, also called Massart noise fl28l . a much more realistic and widely studied noise model in 
statistical learning theory @. We additionally show strong lower bounds under the same noise model for two 
other computationally efficient learning algorithms (hinge loss minimization and the averaging algorithm), 
which could be of independent interest. 

Motivation The work on computationally efficient algorithms for learning halfspaces has focused on two 
different extremes. On one hand, for the very stylized random classification noise model (RCN), where each 
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example x is flipped independently with equal probability //, several works have provided computationally 
efficient algorithms that can achieve arbitrarily small excess error in polynomial time (71 [30[ El — note 
that all these results crucially exploit the high amount of symmetry present in the RCN noise. At the other 
extreme, there has been significant work on much more difficult and adversarial noise models, including 
the agnostic model (251 and malicious noise models (24l . The best results here however, not only require 
additional distributional assumptions about the marginal over the instance space, but they only achieve 
much weaker multiplicative approximation guarantees (23l [271 IZft : for example, the best result of this form 
for the case of uniform distribution over the unit sphere S^-i achieves excess error cOPT 0, for some 
large constant c. While interesting from a technical point of view, guarantees of this form are somewhat 
troubling from a statistical point of view, as they are inconsistent, in the sense there is a bander O(OPT), 
after which we cannot prove that the excess error further decreases as we get more and more samples. In 
fact, recent evidence shows that this is unavoidable for polynomial time algorithms for such adversarial 
noise models lfl2l . 

Our Results In this work we identify a realistic and widely studied noise model in the statistical learning 
theory, the so called Massart noise (9}, for which we can prove much stronger guarantees. Massart noise 
can be thought of as a generalization of the random classification noise model where the label of each 
example x is flipped independently with probability tj(x) < 1/2. The adversary has control over choosing 
a different noise rate r/(.x) < rj for every example x with the only constraint that q(x) < rj. From a 
statistical point of view, it is well known that under this model, we can get faster rates compared to worst 
case joint distributions (9j. In computational learning theory, this noise model was also studied, but under 
the name of malicious misclassffication noise (29l [311 . However due to its highly unsymmetric nature, til 
date, computationally efficient learning algorithms in this model have remained elusive. In this work, we 
provide the first computationally efficient algorithm achieving arbitrarily small excess error for learning 
linear separators. 

Formally, we show that there exists a polynomial time algorithm that can learn linear separators to error 
OPT + e and run in polyfri. -) when the underlying distribution is the uniform distribution over the unit ball 
in R' 1 and the noise of each example is upper bounded by a constant r) (independent of the dimension). 

As mentioned earlier, a result of this form was only known for random classification noise. From a 
technical point of view, as opposed to random classification noise, where the error of each classifier scales 
uniformly under the observed labels, the observed error of classifiers under Masasart noise could change 
drastically in a non-monotonic fashion. This is due to the fact that the adversary has control over choosing 
a different noise rate r](x) < t] for every example x. As a result, as we show in our work (see Section [4]), 
standard algorithms such as the averaging algorithm (30l which work for random noise can only achieve 
a much poorer excess error (as a function of //) under Massart noise. Technically speaking, this is due to 
the fact that Massart noise can introduce high correlations between the observed labels and the component 
orthogonal to the direction of the best classifier. 

In face of these challenges, we take an entirely different approach than previously considered for random 
classification noise. Specifically, we analyze a recent margin based algorithm of 0. This algorithm was 
designed for learning linear separators under agnostic and malicious noise models, and it was shown to 
achieve an excess error of cOPT for a constant c. By using new structural insights, we show that there 
exists a constant rj (independent of the dimension), so that if we use Massart noise where the flipping 
probability is upper bounded by //, we can use a modification of the algorithm in 0 and achieve arbitrarily 
small excess error. One way to think about this result is that we define an adaptively chosen sequence of 
hinge loss minimization problems around smaller and smaller bands around the current guess for the target. 
We show by relating the hinge loss and 0/1-loss together with a careful localization analysis that these will 
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direct us closer and closer to the optimal classifier, allowing us to achieve arbitrarily small excess error rates 
in polynomial time. 

Given that our algorithm is an adaptively chosen sequence of hinge loss minimization problems, one 
might wonder what guarantee one-shot hinge loss minimization could provide. In Section [5] we show a 
strong negative result: for every r, and ij < 1/2, there is a noisy distribution I) over 'R d x {0,1} satisfying 
Massart noise with parameter r/ and an e > 0, such that r-hinge loss minimization returns a classifier with 
excess error Q(e). This result could be of independent interest. While there exists earlier work showing that 
hinge loss minimization can lead to classifiers of large 0/1-loss @, the lower bounds in that paper employ 
distributions with significant mass on discrete points With flipped label (which is not possible under Massart 
noise) at a very large distance from the optimal classifier. Thus, that result makes strong use of the hinge 
loss’s sensitivity to errors at large distance. Here, we show that hinge loss minimization is bound to fail 
under much more benign conditions. 

One appealing feature of our result is the algorithm we analyze is in fact naturally adaptable to the active 
learning or selective sampling scenario (intensively studied in recent years urn Emm where the learning 
algorithms only receive the classifications of examples when they ask for them. We show that, in this model, 
our algorithms achieve a label complexity whose dependence on the error parameter e is polylogarithmic 
(and thus exponentially better than that of any passive algorithm). This provides the first polynomial-time 
active learning algorithm for learning linear separators under Massart noise. We note that prior to our work 
only inefficient algorithms could achieve the desired label complexity under Massart noise Il4ll20ll. 

Related Work The agnostic noise model is notoriously hard to deal with computationally and there is 
significant evidence that achieving arbitrarily small excess error in polynomial time is hard in this model |E1 
EH El- For this model, under our distributional assumptions, 11231 provides an algorithm that learns linear 
separators in W l to excess error at most e, but whose running time poly(d exp ^ 1 ^ e ' > ). Recent work show 
evidence that the exponential dependence on 1/e is unavoidable in this case f26l for the agnostic case. We 
side-step this by considering a more structured, yet realistic noise model. 

Motivated by the fact that many modern machine learning applications have massive amounts of unanno¬ 
tated or unlabeled data, there has been significant interest in designing active learning algorithms that most 
efficiently utilize the available data, while minimizing the need for human intervention. Over the past decade 
there has been substantial progress on understanding the underlying statistical principles of active learning, 
and several general characterizations have been developed for describing when active learning could have an 
advantage over the classical passive supervised learning paradigm both in the noise free settings and in the 
agnostic case [EH EH 12 01 EH EH EH El M • However, despite many efforts, except for very simple noise 
models (random classification noise Q and linear noise l lThl ). to date there are no known computationally 
efficient algorithms with provable guarantees in the presence of Massart noise that can achieve arbitrarily 
small excess error. 

We note that work of 12T1 provides computationally efficient algorithms for both passive and active 
learning under the assumption that the hinge loss (or other surrogate loss) minimizer aligns with the mini- 
mizer of the 0/1-loss. In our work (Section [5]), we show that this is not the case under Massart noise even 
when the marginal over the instance space is uniform, but still provide a computationally efficient algorithm 
for this much more challenging setting. 

2 Preliminaries 

We consider the binary classification problem; that is, we work on the problem of predicting a binary label 
y for a given instance x. We assume that the data points (x, y) are drawn from an unknown underlying 
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distribution D over X x Y, where X = W l is the instance space and Y = {—1,1} is the label space. 
For the purpose of this work, we consider distributions where the marginal of D over X is a uniform 
distribution on a d-dimensional unit ball. We work with the class of all homogeneous halfspaces, denoted 
by H = (sign('uj ■ x) : w E W 1 }. For a given halfspace w 6 'H, we define the error of w with respect to D, 

by eir 3 (w) = Pr (a , j?/) ^[sign(u; • x) + y\. 

We examine learning halfspaces in the presence of Massart noise. In this setting, we assume that the 
Bayes optimal classifier is a linear separator w*. Note that w* can have a non-zero error. Then Massart 
noise with parameter f3 > 0 is a condition such that for all x, the conditional label probability is such that 

| Pr(y = 1|®) — Pr(y = — l|x)| > /3. (1) 

Equivalently, we say that D satisfies Massart noise with parameter fi, if an adversary construct I) by first 
taking the distribution D over instances (x, signfu;* • x )) and then flipping the label of an instance x with 
probability at most 1 2 /5 . Q Also note that under distribution D, w* remains the Bayes optimal classier. In 
the remainder of this work, we refer to D as the “noisy” distribution and to distribution D over instances 
(x, sign(m* • x)) as the “clean” distribution. 

Our goal is then to find a halfspace w that has small excess error, as compared to the Bayes optimal 
classifier w*. That is, for any e > 0, find a halfspace w, such that err p,{w) — err jj(w*) < e. Note that 
the excess error of any classifier w only depends on the points in the region where w and w* disagree. So, 
exifj(iu) — en jj(w*) < e ( w '™ 1 . Additionally, under Massart noise the amount of noise in the disagreement 
region is also bounded by It is not difficult to see that under Massart noise, 

P e{w ^ < err^ (w) - err^ (w *). (2) 

In our analysis, we frequently examine the region within a certain margin of a halfspace. For a halfspace 
w and margin b, let S v ,j t be the set of all points that fall within a margin b from w, i.e., S v ,j, = {x : \w ■ 
x\ < b}. For distributions D and D, we indicate the distribution conditioned on S u! x by I) u ,j, and I) H -x, 
respectively. In the remainder of this work, we refer to the region S w ^ as “the band”. 

In our analysis, we use hinge loss, as a convex surrogate function for the 0/1-loss. For a halfspace w, we 
use r-normalized hinge loss that is defined as x. y) = max{0,1 — ■ v, ' xiv }. For a labeled sample set 
W, let £(w, W) = |^| y)&w x ’ v) em piri c al hinge loss of a vector w with respect to W. 

3 Computationally Efficient Algorithm for Massart Noise 

In this section, prove our main result for learning half-spaces in presence of Massart noise. We focus on the 
case where D is the uniform distribution on the d-dimensional unit ball. Our main Theorem is as follows. 

Theorem 1. Let the optimal bayes classifier be a half-space denoted by w*. Assume that the massart 
noise condition holds for some (5 > 1 — 3.6 x 10 !i . Then for any e, 5 > 0, AIgorithm [7] with A = 10 -8 , 
og. = 0.03 87097r(l — A) fc_1 , b^-i = 2 andrk = \/2.50306 (3.6 X 10~ 6 ) 1 / 4 6fc_i, runs in polynomial 

time, proceeds in s = 0(log-) rounds, where in round k it takes = poly(d, exp(fe), log(|)) unlabeled 
samples and mk = 0(d(d + log(fc/d))) labels and with probability (1 — 6) returns a linear separator that 
has excess error (compared to tv*) of at most e. 

'Note that the relationship between Massart noise parameter j3, and the maximum flipping probability discussed in the intro¬ 
duction y, is y = Xfi. 
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Note that in the above theorem and Algorithm [T] the value of 3 is unknown to the algorithm, and 
therefore, our results are adaptive to values of /3 within the acceptable range defined by the theorem. 

The algorithm described above is similar to that of ||2]] and uses an iterative margin-based approach. The 
algorithm runs for s = log i (|) rounds for a constant A £ (0,1]. By induction assume that our algorithm 
produces a hypothesis w^-i at round k — 1 such that 6[wk-\,w*) < a We satisfy the base case by 
using an algorithm of l27ll . At round k, we sample nip- labeled examples from the conditional distribution 
D Wk l} b k _ 1 which is the uniform distribution over {x : \wk-i ■ x\ < We then choose wp~ from 

the set of all hypothesis B{wk-\,ap3) = {w ■ 9(w,Wk- 1 ) < atk] such that Wk minimizes the empirical 
hinge loss over these examples. Subsequently, as we prove in detail later, 9(wk,w*) < a.k+\- Note that 
for any w, the excess error of w is at most the error of w on I) when the labels are corrected according to 
w*, i.e., err jj(w) — err ^(w*) < err d{w). Moreover, when D is uniform, errofm) = e(u '_' u,) . Hence, 
6(w s , w*) < Tre implies that w s has excess error of at most e. 

The algorithm described below was originally introduced to achieve an error of c • err (w*) for some con¬ 
stant c in presence of adversarial noise. Achieving a small excess error err(m*)+e is a much more ambitious 
goal - one that requires new technical insights. Our two crucial technical innovations are as follow: We first 
make a key observation that under Massart noise, the noise rate over any conditional distribution D is still 
at most Therefore, as we focus on the distribution within the band, our noise rate does not increase. 
Our second technical contribution is a careful choice of parameters. Indeed the choice of parameters, upto 
a constant, plays an important role in tolerating a constant amount of Massart noise. Using these insights, 
we show that the algorithm by Q can indeed achieve a much stronger guarantee, namely arbitrarily small 
excess error in presence of Massart noise. That is, for any e, this algorithm can achieve error of err(u>*) + e 
in the presence of Massart noise. 

Algorithm 1 Efficient Algorithm for Arbitrarily Small Excess Error for Massart Noise 

Input: A distribution D. An oracle that returns x and an oracle that returns y for a (x. y ) sampled from I). 
Permitted excess error e and probability of failure 5. 

Parameters: A learning rate A; a sequence of sample sizes nip ,; a sequence of angles of the hypothesis 
space ; a sequence of widths of the labeled space bp/, a sequence of thresholds of hinge-loss 77 ,.. 

Algorithm: 

1. Take poly (d, 1) samples and run poly (d, j )-time algorithm by G71 to find a half-space wq with excess 
error 0.0387089 such that 6(w *, w$) < 0.038 7097T (Refer to Appendix [C]) 

2. Draw mi examples (x, y) from D and put them into a working set W. 

3. For k = 1,... ,log ( _i_)(i) = s. 

(a) Find vp- such that \\vk — Wk- 1 1 < a/,- (as a result vp~ £ B(wk~ i, a:/,.)), that minimizes the empir¬ 
ical hinge loss over W using threshold 7 > 0 . That is £ Tk (■ Vk , W) < min wgB ( U)fe _ i afc ) i Tk (w, W) + 

10 ” * 1 2 3 * * * * 8 . 

(b) Clear the working set W. 

(c) Normalize Vk to uy. = mJV • Until nik+ \ additional examples are put in W, draw an example x 
from D. If | Wk ■ x\ > bp., then reject x, else put ( x , y) into W. 

Output: Return w s , which has excess error e with probability 1 — d. 


5 







Overview of our analysis: Similar to 0, we divide err/>('«’/,.) to two categories; error in the band, i.e., on 
x G S Wk lj b k l , and error outside the band, on x 0 S Wk lt b k _ 1 - We choose bk-i and such that, for every 
hypothesis w G B(wk- 1 , otk) that is considered at step k, the probability mass outside the band such that w 
and w* also disagree is very small (Lemma [5]). Therefore, the error associated with the region outside the 
band is also very small. This motivates the design of the algorithm to only minimize the error in the band. 
Furthermore, the probability mass of the band is also small enough such that for err niwk) < &k +1 to hold, 
it suffices for Wk to have a small constant error over the clean distribution restricted to the band, namely 
D Wk _ 1 fi k _ 1 - 

This is where minimizing hinge loss in the band comes in. As minimizing the 0/1-loss is NP-hard, 
an alternative method for finding Wk with small error in the band is needed. Flinge loss that is a convex 
loss function can be efficiently minimized. So, we can efficiently find Wk that minimizes the empirical 
hinge loss of the sample drawn from D Wkl b k _ 1 - To allow the hinge loss to remain a faithful proxy of 
0/1-loss as we focus on bands with smaller widths, we use a normalized hinge loss function defined by 
40, x, y ) = max{0,1 - ^}. 

A crucial part of our analysis involves showing that if Wk minimizes the empirical hinge loss of the 
sample set drawn from D Wk lt b k _ 1 , it indeed has a small 0/1-error on D Wk _ 1} b k _ 1 . To this end, we first 
show that when Tp. is proportional to bp., the hinge loss of w* on D Wk _ 1 b k _ 1 , which is an upper bound on 
the 0/1-error of Wk in the band, is itself small (Lemma |T]). Next, we notice that under Massart noise, the 
noise rate in any marginal of the distribution is still at most 1 2 . Therefore, focusing the distribution in 
the band does not increase the probability of noise in the band. Moreover, the noise points in the band are 
close to the decision boundary so intuitively speaking, they can not increase the hinge loss too much. Using 
these insights we can show that the hinge loss of Wk on D Wk _ 1 ,b k _ 1 is close to its hinge loss on D Wk _ lt b k _ 1 
(Lemma [2]). 

Proof of Theorem [Hand related lemmas 

To prove Theorem [T] we first introduce a series of lemmas concerning the behavior of hinge loss in the band. 
These lemmas build up towards showing that Wk has error of at most a fixed small constant in the band. 

For ease of exposition, for any k, let I)p. and Dk represent D Wk _ 1} b k _ 1 and D Wk _ ] p )k _ l , respectively, and 
£(■) represent 4 fc (■)• Furthermore, let c = 2.3463, such that bp,_ \ = ^|. 

Our first lemma, whose proof appeals in Appendix [Bj provides an upper bound on the true hinge error 
of w* on the clean distribution in the band. 

Lemma 1. E^ xy ^ Dk £(w*,x,y) < 0.665769^. 

The next Lemma compares the true hinge loss of any w G B(wk- i, oik) on two distributions, Dk and 
Dk . It is clear that the difference between the hinge loss on these two distributions is entirely attributed to 
the noise points and their margin from w. A key insight in the proof of this lemma is that as we concentrate 
in the band, the probability of seeing a noise point remains under A ■ This is due to the fact that under 
Massart noise, each label can be changed with probability at most -^A ■ Furthermore, by concentrating in 
the band all points are close to the decision boundary of Wk-i- Since w is also close in angle to Wk-i, then 
points in the band are also close to the decision boundary of w. Therefore the hinge loss of noise points in 
the band can not increase the total hinge loss of w by too much. 

Lemma 2. For any w such that w G B(wk~ i, oik), we have 

I ^(x,y)~Dj{wiX,y) - E( x ^ D J(w,x,y)\ < 1.092\/2 V / 1 - 
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Proof. Let N be the set of noise points. We have, 


I E (x,y)~D/( W ’ X i V) ~ E (x,y)~Dj(™, X,y) I = \ E (x,y)eD k ( £ ( w > X > V) ~ x i si g n ( w * ' x )) 

< E ( x ,y)^D k {lxeN(t(w, x, y) - £(w, x, -y))) 

I w ■ x\ 


- m (x,y)~D k [ L xeN 


Tk 


< 


Tk 


I Pr (x € N ) x J E, y )r^D k ( w ' a: ) 2 (By Cauchy Shwarz) 
Y (x,y)~Dk * ’ 


< — \ -——\ —-—f- &?_, (By Definition 4.1 of [j2l for uniform) 
Tk\ 2 \Jd—1 


<VVW— 


d 


b k -i 

r fc y (d - i)c 2 
; bk -1 


+ 1 


< 1.092V2\/l — (for d > 20, c > 1) 

T~k 


n 

For a labeled sample set W drawn at random from D k , let cleaned (IP) be the set of samples with the 
labels corrected by w*, i.e., cleaned(lP) = {(x, sign (in* • x)) : for all (x,y) E W}. Then by standard 
VC-dimension bounds (Proof included in Appendix |B|) there is E 0(d(d + log (k/d))) such that for 
any randomly drawn set W of nrik labeled samples from Dk, with probability 1 — 0 ^+k 2 ) ' ^ or an y w ^ 
B(w k -i,a k ), 


\ E {x,y)^D k £ ( W ^ X . y ) - £ ( w > W )\ < 1 ° 8 > ( 3 ) 

\^(x,y)~D/{Wi x, y) - £{w, cleaned(FF))| < 1CT 8 . (4) 

Our next le mm a is a crucial step in our analysis of Algorithm[I] This lemma proves that if Wk E B (wk-i , o k ) 
minimizes the empirical hinge loss on the sample drawn from the noisy distribution in the band, namely 
D Wk _ 1 ,b k _ 1 , then with high probability Wk also has a small 0/1-error with respect to the clean distribution in 
the band, i.e., D Wk libk l . 

Lemma 3. There exists rrik E 0(d( d + \og{k/d))), such that for a randomly drawn labeled sampled set W 
of size nikfrom Dk, and for Wk such that Wk has the minimum empirical hinge loss on W between the set 
of all hypothesis in B(wk~i, a k ), with probability 1 — 2 {k+k 2 ) • 

err Dk (w k ) < 0.757941-^- + 3.303^1^^^ + 3.28 x 1(T 8 . 

Vk— 1 1~k 

Proof Sketch First, we note that the true 0/1-error of w & on any distribution is at most its true hinge loss on 
that distribution. Lemma[I]provides an upper bound on the true hinge loss on distribution D k - Therefore, it 
remains to create a connection between the empirical hinge loss of w k on the sample drawn from D k to its 
true hinge loss on distribution D k . This, we achieve by using the generalization bounds of Equations [3] and 
[4]to connect the empirical and true hinge loss of w k and w*, and using Lemma[2]to connect the hinge of w k 
and w* in the clean and noisy distributions. □ 
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Proof of Theorem [l] For ease of exposition, let c = 2.3463. Recall that A = 10 8 , a k = 0.0387097r(l — 
A) fe_1 , bk-i = ^,r k = V2.50306 (3.6 x 10- 6 ) 1 / 4 6 fc -i, and /3 > 1 - 3.6 x 10~ 6 . 

Note that for any w, the excess error of w is at most the error of w on the clean distribution D, i.e., 
err^('«.’) — err^n;*) < err d(w). Moreover, for uniform distribution D, err^m) = ) . Hence, to 

show that w has e excess error, it suffices to show that err d(w) < e. 

Our goal is to achieve excess error of 0.038709(1 — X) k at round k. This we do indirectly by bounding 
err o(wk) at every step. We use induction. For k = 0, we use the algorithm for adversarial noise model by 
1271 . which can achieve excess error of e if err ^ (w *) < 256 iog(i/ € ) (Refer to Appendix |c| for more details). 

For Massart noise, err^ (w*) < So, for our choice of A, this algorithm can achieve excess error of 

0.0387089 in poly(d, |) samples and run-time. Furthermore, using Equation[2j 6(wo, w*) < 0.03 87097T. 

Assume that at round k— 1, err D(w k - 1 ) < 0.038709(1 — A) fc_1 . We will show that w k , which is chosen 
by the algorithm at round k, also has err o{wk) < 0.038709(1 — \) k . 

First note that err£>{wk-i) < 0.038709(1 — A) fc_1 implies 9{w k -i,w*) < a k . Let S = S Wk _ lt b k _ 1 
indicate the band at round k. We divide the error of w k to two parts, error outside the band and error inside 
of the band. That is 


err D {w k ) 


Pr [x £ S and (w k ■ x)(w* ■ x) < 0] + Pr [ieS and (w k • x)(w* ■ x) < 0]. 

x^D x~D 


For the first pail, i.e., error outside of the band, Yy x ^d[x ^ S and (w k ■ x)(w* ■ x) < 0] is at most 

2ak _ (d— 2 ) 

Pr [x S and (w k ■ x){w k ~i • x) < 0] + Pr [x ^ S' and (w k -1 • x)(w • x) < 0] < -e ^ , 

x^D x~D IT 

where this inequality holds by the application of Lemma [5] and the fact that 6{w k _i,w k ) < a k and 
6(w k -i,w*) < a k . 

Lor the second part, i.e., error inside the band 

Pr [x G S and {w k ■ x)(w* • x) < 0] = err D k {w k ) Pr [x G S] 
x~D x~D 

< err Dk (w k ) 2 b k _ i (By Lemma [4} 

< err Dk (w k ) ca k 

where the last transition holds by the fact that |[8|]. Replacing an upper bound on erro,. ('«’/,•) 

from Lemma[3j to show that erio(w k ) < '‘(j 1 , it suffices to show that the following inequality holds. 

f^O.757941——|- 3.303y/l - /3^- + 3.28 x 10“ 8 ^) c a k \/^-^ + ^ e -^r^ < 

V Ofc-i r k J \ nd 7T vr 

We simplify this inequality as follows. 

f 0.757941-—^— + 3.303y/l — + 3.28 x 10" 8 ) c ,/ 2?r(f/ + ^ < 1 - A. 

V b k -i r k J V d 

Replacing in the r.h.s., the values of c = 2.3463, and r k = \/2. 50306(3.6 x 10 ~ 6 ) 1 ^b k -\, we have 


2 (d + 1) 
ird 


^2.50306(3.6 x 10“ 6 ) 1/4 + V2.50306--- 

V (3-6 


x 10 -6 ) 1 / 4 


+ 3.28 X 10 


d 


c 

































< 5.88133 ^2\/2.50306(3.6 x 10~ 6 ) 1/4 + 3.28 x 10~ 8 ) ^ +0.167935 (For d > 20) 

< 0.998573 < 1 - A 

Therefore, erro('tr'fc) < 0.038709(1 — \) k . 

Sample complexity analysis: We require mk labeled samples in the band S Wk _ l x k _ 1 at round k. By 
Lemma[4] the probability that a randomly drawn sample from D falls in S Wk lt b k l is at least 0(bk-\\fd) = 
0(( 1 — A) fc_1 ). Therefore, we need 0((1 — A ) k ~ 1 m k ) unlabeled samples to get mk examples in the band 
with probability 1 — S ( k +k 2 ) • So, the total unlabeled sample complexity is at most 

^2 O ((i - A ) k ~ l m2j < s^2m k e O (- log (d + log log ^/ e ^ \ . 
k =l k =i V e /V J J 

□ 


4 Average Does Not Work 

Our algorithm described in the previous section uses convex loss minimization (in our case, hinge loss) in 
the band as an efficient proxy for minimizing the 0/1 loss. The Average algorithm introduced by 1301 is 
another computationally efficient algorithm that has provable noise tolerance guarantees under certain noise 
models and distributions. For example, it achieves arbitrarily small excess error in the presence of random 
classification noise and monotonic noise when the distribution is uniform over the unit sphere. Furthermore, 
even in the presence of a small amount of malicious noise and less symmetric distributions, Average has 
been used to obtain a weak learner, which can then be boosted to achieve a non-trivial noise tolerance | [27l . 
Therefore it is natural to ask, whether the noise tolerance that Average exhibits could be extended to the 
case ofMassart noise under the uniform distribution ? We answer this question in the negative. We show that 
the lack of symmetry in Massart noise presents a significant barrier for the one-shot application of Average, 
even when the marginal distribution is completely symmetric. Additionally, we also discuss obstacles in 
incorporating Average as a weak learner with the margin-based technique. 

In a nutshell. Average takes m sample points and their - respective labels, W = {(cc 1 , 2 / 1 ),..., ( x m ,y m )}, 
and returns — ]>//” , x l y l . Our main result in this section shows that for a wide range of distributions that 
are very symmetric in nature, including the Gaussian and the uniform distribution, there is an instance of 
Massart noise under which Average can not achieve an arbitrarily small excess error. 

Theorem 2. For any continuous distribution D with a p.d.fi that is a function of the distance from the origin 
only, there is a noisy distribution D over X x {0,1} that satisfies Massart noise condition in Equation^for 
some parameter (3 > 0 and Average returns a classifier with excess error Q( f ] ). 

Proof Let w* = (1,0,..., 0) be the target halfspace. Let the noise distribution be such that for all x, if 
x i X‘> < 0 then we flip the label of x with probability otherwise we keep the label. Clearly, this satisfies 
Massart noise with parameter f3. Let w be expected vector returned by Average. We first show that w is far 
from w* in angle. Then, using Equation[2]we show that w has large excess error. 

First we examine the expected component of w that is parallel to w*, i.e., w ■ w* = w\. For ease of 
exposition, we divide our analysis to two cases, one for regions with no noise (first and third quadrants) 
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and second for regions with noise (second and fourth quadrants). Let E be the event that x±X 2 > 0. By 
symmetry, it is easy to see that Pr [E] = 1/2. Then 


E[w • m*] = Pr(E) E[m • w*\E\ + Pr (E) E[w • w*\E] 

For the first term, for x € E the label has not changed. So, E[m • w*\E] = E[|xi| | E\ = fg zf(z). For 
the second term, the label of each point stays the same with probability and is flipped with probability 
Hence, E[m • w*\E} = (3 E[|xi| | E] = (3 fg zf(z). Therefore, the expected parallel component of w 
is E [w -w*} = ±±£ Jq zf(z) 

Next, we examine W 2 , the orthogonal component of w on the second coordinate. Similar to the previous 
case for the clean regions Efu^l-E 1 ] = Ef|.x‘ 2 1 | E\ = fg zf(z). Next, for the second and forth quadrants, 
which are noisy, we have 


E 


{x,y)~D\ X iy\ X 1*2 < 0 ] = ( 


+ (1^) 


JM + 3-A) [\- z )M 


-1 


/-1 


/( 


-zM + ( W) 


= -( : 


1 + /3. f 1 f(z) , A-E f 1 f{z) 


•L 


+ 2 


1 + E [ 1 f{z) , .1-/3. f 1 f(z) 


l 


+ (——) 


= ~P 


Jo 


z). 


So, W2 = fo z f( z )■ Therefore 9(w,w*) = arctan(pp) > (p+fy- 

eix 3 (w) - en-fj(w *) > P • 


(Fourth quadrant) 
(Second quadrant) 


(By symmetry) 


By Equation 


0 


we have 

□ 


Our margin-based analysis from Section [3] relies on using hinge-loss minimization in the band at every 
round to efficiently find a halfspace Wk that is a weak learner for I)}., i.e., err/p ('«;/.) is at most a small 
constant, as demonstrated in Lemma [3] Motivated by this more lenient goal of finding a weak learner, one 
might ask whether Average, as an efficient algorithm for finding low error halfspaces, can be incorporated 
with the margin-based technique in the same way as hinge loss minimization? We argue that the margin- 
based technique is inherently incompatible with Average. 

The Margin-based technique maintains two key properties at every step: First, the angle between Wk 
and Wk -1 and the angle between i and w* are small, and as a result 0(vf. iOf .) is small. Second, Wk 
is a weak learner with err D k _ 1 (' lz ’k) at most a small constant. In our work, hinge loss minimization in the 
band guarantees both of these properties simultaneously by limiting its search to the halfspaces that are 
close in angle to Wk-i and limiting its distribution to D Wk _ 1 ^ k _ 1 - However, in the case of Average as we 
concentrate in the band D Wk _ lt b k _ 1 we bias the distributions towards its orthogonal component with respect 
to Wk- 1 - Hence, an upper bound on 9(w*, Wk-i) only serves to assure that most of the data is orthogonal to 
w* as well. Therefore, informally speaking, we lose the signal that otherwise could direct us in the direction 
of w*. More formally, consider the construction from Theorem [2] such that Wk~\ = w* = (1,0,..., 0). In 
distribution D Wk _ lt the component of Wk that is parallel to Wk-i scales down by the width of the band, 
bk- 1 . However, as most of the probability stays in a band passing through the origin in any log-concave 
(including Gaussian and uniform) distribution, the orthogonal component of Wk remains almost unchanged. 
Therefore, 9(w k ,w*) = 9(w k ,w k - 1 ) € 
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5 Hinge Loss Minimization Does Not Work 


Hinge loss minimization is a widely used technique in Machine Learning. In this section, we show that, 
perhaps surprisingly, hinge loss minimization does not lead to arbitrarily small excess error even under very 
small noise condition, that is it is not consistent. (Note that in our setting of Massart noise, consistency is 
the same as achieving arbitrarily small excess error, since the Bayes optimal classifier is a member of the 
class of halfspaces). 

It has been shown earlier that hinge loss minimization can lead to classifiers of large 0/1-loss @. 
However, the lower bounds in that paper employ distributions with significant mass on discrete points with 
flipped label (which is not possible under Massart noise) at a very large distance from the optimal classifier. 
Thus, that result makes strong use of the hinge loss’s sensitivity to errors at large distance. Here, we show 
that hinge loss minimization is bound to fail under much more benign conditions. More concretely, we show 
that for every parameter r, and arbitrarily small bound on the probability of flipping a label, p = - 1 // 2 , hinge 
loss minimization is not consistent even on distributions with a uniform marginal over the unit ball in ’ft 2 , 
with the Bayes optimal classifier being a halfspace and the noise satisfying the Massart noise condition with 
bound p. That is, there exists a constant e > 0 and a sample size m(e) such that hinge loss minimization 
returns a classifier of excess error at least e with high probability over sample size of at least m(e). 

Hinge loss minimization does approximate the optimal hinge loss. We show that this does not translate 
into an agnostic learning guarantee for halfspaces with respect to the 0/1-loss even under very small noise 
conditions. Let Vp be the class of distributions D with uniform marginal over the unit ball B\ C 5ft 2 , the 
Bayes classifier being a halfspace w, and satisfying the Massart noise condition with parameter f3. Our 
lower bound for hinge loss minimization is stated as follows. 

Theorem 3. For every hinge-loss parameter r > 0 and every Massart noise parameter 0 < j3 < 1, there 
exists a distribution D r p E Vp (that is, a distribution over B\ X {— 1 , 1 } with uniform marginal over 
B | C 5ft 2 satisfying the Q-Massart condition) such that t- hinge loss minimization is not consistent on D T ,p 
with respect to the class of halfspaces. That is, there exists an e > 0 and a sample size m(e) such that hinge 
loss minimization will output a classifier of excess error larger e (with high probability over samples of size 
at least m(e)). 


Proof idea To prove the above result, we define a subclass of V a ,n Q 'Pp consisting of well structured 
distributions. We then show that for every hinge parameter r and every bound on the noise p, there is a 
distribution D E 'P n/ri on which r-hinge loss minimization is not consistent. 

In the remainder of this section, we use the notation h w for the classifier asso¬ 
ciated with a vector w E B\, that is h w (x) = sign(m • x), since for our geometric 
construction it is convenient to differentiate between the two. We define a family 
V a ,T] Q Pp of distributions /) riv); , indexed by an angle a and a noise parameter p as 
follows. Let the Bayes optimal classifier be linear h* = h w * for a unit vector w*. 

Let h w be the classifier that is defined by the unit vector w at angle a from w*. We 
partition the unit ball into areas A, B and D as in the Figure [ 5 ] That is A consists of 
the two wedges of disagreement between h w and h w * and the wedge where the two 
classifiers agree is divided into B (points that are closer to h w than to h w * ) and D 
(points that are closer to h w * than to h w ). We now flip the labels of all points in A and B with probability 
p = Pf- and leave the labels deterministic according to h w * in the area D. 

More formally, points at angle between a/2 and 7r/2 and points at angle between ir + a/2 and —7 t/ 2 
from w* are labeled per h w * ( x ) with conditional label probability 1. All other points are labeled — h w * (x) 


h 

W* 

h w 


/\ k 

D \ 

\ w 

( B 8 a 

J 

\ D 

\ B / 




Figure 1: If ,, 
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with probability p and h w *(x) with probability (1 — rj). Clearly, this distribution satisfies Massart noise 
conditions in Equation [I] with parameter ft. 

The goal of the above construction is to design distributions where vectors along the direction of w 
have smaller hinge loss of those along the direction of w *. Observe that the noise in the are A will tend to 
“even out” the difference in hinge loss between w and w* (since are A is symmetric with respect to these 
two directions). The noise in area B however will “help w”: Since all points in area B are closer to the 
hyperplane defined by w than to the one defined by w*, vector w* will pay more in hinge loss for the noise 
in this area. In the corresponding area D of points that are closer to the hyperplane defined by w* than to 
the one defined by w we do not add noise, so the cost for both w and w* in this area is small. 

We show that for every a, from a certain noise level rj on, w*(or any other vector in its direction) is 
not the expected hinge minimizer on D QtV . We then argue that thereby hinge loss minimization will not 
approximate w* arbitrarily close in angle and can therefore not achieve arbitrarily small excess 0/1-error. 
Overall, we show that for every (arbitrarily small) bound on the noise rj o and hinge parameter to, we can 
choose an angle a such that To-hinge loss minimization is not consistent for distribution D a>r)0 . The details 
of the proof can be found in the Appendix, Section|D| 

6 Conclusions 

Our work is the first to provide a computationally efficient algorithm under the Massart noise model, a 
distributional assumption that has been identified in statistical learning to yield fast (statistical) rates of 
convergence. While both computational and statistical efficiency is crucial in machine learning applications, 
computational and statistical complexity have been studied under disparate sets of assumptions and models. 
We view our results on the computational complexity of learning under Massart noise also as a step towards 
bringing these two lines of research closer together. We hope that this will spur more work identifying 
situations that lead to both computational and statistical efficiency to ultimately shed light on the underlying 
connections and dependencies of these two important aspects of automated learning. 
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A Probability Lemmas For The Uniform Distribution 

The following probability lemmas are used throughout this work. Variation of these lemmas are presented 
in previous work in terms of their asymptotic behavior 00ii22]. Here, we focus on finding bounds that are 
tight even when the constants are concerned. Indeed, the improved constants in these bounds are essential 
to tolerating Massart noise with (5 > 1 — 3.6 x 10~ 6 . 

Throughout this section, let D be the uniform distribution over a d-dimensional ball. Let /(•) indicate 
the p.d.f. of D. Lor any d, let V ( j be the volume of a d-dimensional unit ball. Ratios between volumes of the 
unit ball in different dimensions are commonly used to find the probability mass of different regions under 
the uniform distribution. Note that for any d 

Vd-2 _ d 
V d 2n' 

The following bound due to j8] proves useful in our analysis. 



The next lemma provides an upper and lower bound for the probability mass of a band in uniform distribu¬ 
tion. 
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Lemma 4. Let u be any unit vector in ifc d . For all a,b G [ — , ^=], such that C < d/2, we have 


\b — a\2 C ^ d 1 < Pr [u ■ x G [a, fell < |fe — a| ^ 1 . 
a;~Z) L 




Proof. We have 


Pr [« • x G [a, b ]] = /V - z 2 )^- 1 )/ 2 

v* j a 


For the upper bound, we note that the integrant is at most 1, so Pr x ~d[u • x G [o, 6]] < 16 — a| . For 

the lower bound, note that since a, 6 G [— , -P-], the integrant is at least (1 — ^)( d-1 )/ 2 . We know that 

for any x G [0,0.5], 1 — x > 4 _a; . So, assuming that d > 2C, (1 — C)(d-i )/2 > 4 _ §( d_1 )/ 2 > 2T C 
P i x ~d[u ■ x G [a, 6 ]] > 16 — a|2 _c '-^p i . □ 

Lemma 5. Let u and v be two unit vectors in and let a = 9(u, v ). Then, 


CCX Oi c z (d- 2) 

Pr [sign(tt • x) f sign(u; • x) and \u • x\ > —i=\ < —e 2 d 

x~D vd tt 

Proof. Without the loss of generality, we can assume u = (1,0,..., 0) and w = (cos(a), sin(cr), 0,..., 0). 
Consider the projection of D on the first 2 coordinates. Let E be the event we are interested in. We first 
show that for any x = (xi,X 2 ) G E, ||x ||2 > c/sfd. Consider x\ > 0 (the other case is symmetric). If 


x G E, it must be that llxlU sin(a) > So, llxlU = — c ,°t n > ~h- 

Va sin(o:)va yd 

Next, we consider a circle of radius < r < 1 around the center, indicated by S(r). Let A(r) = 
S(r) Cl E be the arc of such circle that is in E. Then the length of such arc is the arc-length that falls in the 
disagreement region, i.e., ra, minus the arc-length that falls in the band of width ^=. Note, that for every 


x G A(r), ||x|| 2 = r, so /(x) = ^(1 - ||x|| 2 )( rf - 2 >/ 2 = %^(1 - r 2 )(^-2)/2_ 


v d 


Oi f ^ ccx 

Pr [sign(u • x) /sign(m • x) and \u ■ x\ > —=\ = 2 / (ra - —)f(r) dr 

x~D sjd Jo y/d 

v d 

/‘Vu/c pc CCX CP c 

= 2 / (—cr- 7 =)/(—p) —j= dr (change of variable z = rVd/c ) 

J 1 \ d V d y/ d V d 


= 2 


V d - 2 C 2 a 


V d d 
c 2 a r^/ c 


rVd/c 2 r 2 

/ (r-l)(l- —i d - 2)/2 dr 

J 1 d 


7*2 (d—2) 


= — / (r — l)e 2 d dr 

n J 1 

c 2 cr f'Cd (r — 1). — (d — 2)c 2 r. (d- 2 ) c 2 r 2 

< - / / (-!)(——-)e-sa— dr 


7T f (d-2)c 2 r 
d 


d 


byd/c 


( - 1,( d 


— (d — 2 )c 2 r (d- 2 ) 


)e 2 d dr 


a 

< - 
TT 


(d-2)r 2 1 r=Vd/c 

— e 

. r=l 
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Ot . c^(d— 2) { j /n 

< -(e- 2 d— - e -( d - 2 )/ 2 ) 

7 r 

a c 2 (d- 2 ) 

< —e 2d 

7r 


□ 


B Proofs of Margin-based Lemmas 


Proof of Lemma [I] Let L(w*) = t y)~D k K w *, x, y ), r = 77 ., and b = bk- 1 - First note that for our choice 
of 6 < 2.3463 x 0.0121608^=|, using Lemma|4jwe have that 


Pr [|u;fe_i • x| < 6 ] > 2 b x 2 

x~D 


- 0.285329 


Note that L(w*) is maximized when ru* = Wk-\- Then 


2 /q (1 - f)/(«) da J„ T ( 1 - f )(1 - da 

1 1 - Pr^ D [|w fc _i • i| < 6 ] - b 2-»'28532» 


For the numerator: 


f (1 _ “)(1 _ a 2 )-( rf_1 )/ 2 da < T (1 _ « ) e - 2 (^- i )/ 2 da 

Jo T Jo T 

< ^ r e -^- 1 )/ 2 da - - r ae- a ^ d - 1)/2 da 


T Jo 


< 


< 


TT , I id - 1 

ert ri 


2 (d- 1 ) 


1 


(d- l)r 


(1 - e -(rf-iP 2 / 2 ) 


7T 


2(d- 1) 


V / 1 _ g—r 2 (d—t) _ / | ^ ( (d-\)T 2 _ l( (d~l)T 2 )2 


(d- l)r 


- T \fl- T 2 + l {d - 1)T3 

< r(0.5462 H—(d — l)r 2 ) 


(By Taylor expansion) 


< 0.5463r (By -(d - 1 )r 2 < 2 x 10“ 4 ) 
8 


^.sosoeis.exio- 6 ) 1 / 4 ^ 


Where the last inequality follows from the fact that for our choice of parameters r < - 

5^=5, so |(d — 1 )t 2 < 10 -5 . Therefore, 


< 


L(w*) < 0.5463 X 2 


0.285329 JJ < 0.665769^. 


□ 


Proof of Lemma [3] Note that the convex loss minimization procedure returns a vector Vk that is not nec¬ 
essarily normalized. To consider all vectors in B(wk-i,ak), at step k, the optimization is done over all 
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vectors v (of any length) such that \\w k -i — n|| < a k - For all k, a k < 0.038 7097T (or 0.0121608), so 
Ibfclh > 1 — 0.0121608, and as a result £(wk, W) < 1.13844 £(v k , W). We have, 

err Dk (w k ) < E^ y ^ Dk £{w k ,x,y) 

< E (x,y)~F)/( u; fcj x -> y ) + ^l-092v / 2 v / l ( R y Lemma [2]) 

< £(w k , W) + 1.092v/2i/l ~< 3 ~~ + 10 ” 8 (By Equation^ 

< 1.13844 £(v k , W) + 1.092\/2\/l -0^=1 + 1CT 8 (By ||v A || 2 > 1 - 0.0121608) 

T~k 

< 1.13844 £(u;*, IE) + 1.092\/2-^l ~ + 2.14 x 1CT 8 (By v k minimizing the hinge-loss) 

Tk 

< 1.13844 E { s 3 J(w*,x, y) + 1.092^2 V 1 -+ 3.28 x 10" 8 (By Equation|3} 

< 1.13844 E {x ^ d J(w*,x, y) + 2.13844 ^1.092^2^1-/3^^ + 3.28 x 10“ 6 (By Lemma[2} 

< 0.757941+ 3.303^1-/3— + 3.28 x 10~ 8 (By Lemmafll) 

Ok -1 r k LJ 

□ 


Lemma 6. For any constant d, there is m k E 0(d(d + log(A:/d))) such that for a randomly drawn set W 
of nik labeled samples from D k , with probability 1 — ^ fe2 , for any w E B(w k -i,ot k ), 

I E (x,y)~D k ( £ ( w ’ x > y) ~ £ ( w ’ W )) I ^ c '> 

l E (x,y)~D fc {£(w,x,y) - A(m,deaned(TE))) | < c'. 

Proof. By Lemma H.3 of £3, £(w,x,y) = 0(Vd) for all (x,y) E S Wk _ lt b k _ 1 and 6(w,w k - 1 ) < r k . We 

get the result by applying Lemma H.2 of El. □ 

C Initialization 

We initialize our margin based procedure with the algorithm from l27l . The guarantees mentioned in li27ft 
hold as long as the noise rate is 77 < - 1271 do not explicitly compute the constant but it is easy to 

check that c < ^. This can be computed from inequality 17 in the proof of Lemma 16 in ll27l . We need 
the l.h.s. to be at least e 2 /2. On the r.h.s., the first term is lower bounded by e 2 /512. Hence, we need the 
second term to be at most ||§e 2 . The second term is upper bounded by 4c 2 e 2 . This implies that c < 1/256. 

D Hinge Loss Minimization 

In this section, we show that hinge loss minimization is not consistent in our setup, that is, that it does not 
lead to arbitrarily small excess error. We let Bf denote the unit ball in R d . In this section, we will only work 
with d = 2, thus we set B\ = Bf. 
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Recall that the r-hinge loss of a vector w E on an example (x,y) E 5R d x {—1,1} is defined as 
follows: 

oi \ In i 

t T [w, x, y) = max < L), 1-> 

For a distribution D over x {—1,1}, we let denote the expected hinge loss over D, that is 

£? ( w ) = E (x,y)~D £ r(w, X, y). 

If clear from context, we omit the superscript and write C T (w) for £^{w). 

Let A r be the algorithm that minimizes the empirical r-hinge loss over a sample. That is, for W = 

{{xi,yi), • • •, (x m , y m )}, we have 

A r (W) E argmin^^ — ^ £ r (w,x,y). 

1 1 (x,y)ew 

Hinge loss minimization over halfspaces converges to the optimal hinge loss over all halfspace (it is 
“hinge loss consistent”). That is, for all e > 0 there is a sample size m(e) such that for all distributions D, 
we have 

(A r (fL))] < min C°{w) + e. 

In this section, we show that this does not translate into an agnostic learning guarantee for halfspaces 
with respect to the 0/1-loss. Moreover, hinge loss minimization is not even consistent with respect to 
the 0/1-loss even when restricted to a rather benign classes of distributions V. Let Vp be the class of 
distributions D with uniform marginal over the unit ball in R 2 , the Bayes classifier being a halfspace w, and 
satisfying the Massart noise condition with parameter 3. We show that there is a distribution D E Vp and an 
e > 0 and a sample size mo such that hinge loss minimization will output a classifier of excess error larger 
than e on expectation over samples of size larger than mo- More precisely, for all m > mo'. 

e w~d4£?Mt(W'))] > min errjj(in) + e. 

WtzJDl 

Formally, our lower bound for hinge loss minimization is stated as follows. 

Theorem [3] (Restated). For every hinge-loss parameter r > 0 and every Massart noise parameter 0 < 
3 < \, there exists a distribution I) T jj E Vp (that is, a distribution over B\ X {— 1,1} with uniform marginal 
over B\ C 5R 2 satisfying the 3-Massart condition) such that r-hinge loss minimization is not consistent on 
P T ,p with respect to the class of half spaces. That is, there exists an e > 0 and a sample size m(e) such 
that hinge loss minimization will output a classifier of excess error larger than e (with high probability over 
samples of size at least m(e)). 

In the section, we use the notation h w for the classifier associated with a vector w E B\, that is h w (x) = 
sign(u; • x), since for our geometric construction it is convenient to differentiate between the two. The rest 
of this section is devoted to proving the above theorem. 

A class of distributions 
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a T) , indexed by an 


Let i] = We define a family V a>ri C Vp of distributions I) 
angle a and a noise parameter r) as follows. We let the marginal be uniform over the 
unit ball B\ C 'ft 2 and let the Bayes optimal classifier be linear h* = h w * for a unit 
vector w*. Let h w be the classifier that is defined by the unit vector w at angle a 
from w*. We partition the unit ball into areas A, B and D as in the Figure 2. That is 
A consists of the two wedges of disagreement between h w and h w * and the wedge 
where the two classifiers agree is divided in B (points that are closer to h w than to 
h w *) and D (points that are closer to h w * than to h w ). We now “add noise 77” at all 
points in areas A and B and leave the labels deterministic according to h w * in the 
area D. 

More formally, points at angle between a/2 and n/2 and points at angle between tt + a/2 and — ir/2 
from w* are labeled with h w *(x) with (conditional) probability 1. All other points are labeled —h w *(x) 
with probability rj and h w *(x ) with probability (1 — rf). 



Figure 2: I) 


a,rj 


Useful lemmas 

The following lemma relates the r-hinge loss of unit length vectors to the hinge loss of arbitrary vectors in 
the unit ball. It will allow us to focus our attention to comparing the r-hinge loss of unit vectors for t > r 0 , 
instead of having to argue about the to hinge loss of vectors of arbitrary norms in B\. 

Lemma 7. LetT > 0 and 0 < A < 1. Let w and w* be two vectors of unit length. Then C T (Xw) < C T (X w*) 
if and only if C T /x{w) < C T /\(w*). 

Proof By the definition of the hinge loss, we have 

= lr/\(w,x,y). 

□ 


£ t (Xw, x , y) = max ( 0 , 1 — 


y(Xw ■ x) 


= max I 0 , 1 — 


y(w ■ x 
t/X 


Lemma 8. Let r > 0, for any D € V a ,r) let w T denote the halfspace that minimizes the r-hinge loss with 
respect to D. Ifd(w*,w r ) > 0, then hinge loss minimization is not consistent for the 0/1 -loss. 

Proof. First we show that the hinge loss minimizer is never the vector 0. Note that C 1 ? (0) = 1 (for all 
r > 0). Consider the case r > 1, we show that w* has r-hinge loss strictly smaller than 1. Integrating the 
hinge loss over the unit ball using polar coordinates, we get 


1 PIT 


£?(«>*) < f ( (!- ? 7) 




1 PIT 


(1 -sin(( / j)) z dtp dz + 1 7 

10 Jo T 


( 1 -|— sin((£>)) z dtp dz 

'0 Jo T 


1 PIT 


1 PIT 


z -sin(</?) dp dz + 77 


0 Jo 


z -1 -sin(</j) dp dz 


0 Jo 


= 1 + - (1 - 2t?) 


7 r 


1 rw z 2 


-sin(( / j) dp dz 

lo Jo T 


= 1 - - (1 - 27 ?) 


7 r 


1 ^ z 2 


For the case of r < 1, we have 


— sin(</?) dp dz < 1. 
lo Jo T J 

C T {rw*) = C\{w*) < 1. 
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Thus, (0, 0) is not the hinge-minimizer. Then, by the assumption of the lemma w T has some positive angle 
7 to the w*. Furthermore, for all 0 < A < 1, Cj?{w T ) < \w*). Since w i->- C.^{vo) is a continuous 

function we can choose an e > 0 such that 


C“{w T ) + e/2<C“[\w*)-e/2. 

for all 0 < A < 1 (note that the set {Xw* | 0 < A < 1} is compact). Now, we can choose an angle p < 7 
such that for all vectors v at angle at most // from w*, we have 


£r (v) > m i n C!J(Xw*) — e/2 

0<A<1 

Since hinge loss minimization will eventually (in expectation over large enough samples) output classifiers 
of hinge loss strictly smaller than C®(w T ) + e/2, it will then not output classifiers of angle smaller than // to 
w*. By Equation 2 for all w, err f)(w) — err^(w*) > j3 6 ^ w '™ \ therefore, the excess error of a the classfier 
returned by hinge loss minimization is lower bounded by a constant j3^. Thus, hinge loss minimization is 
not consistent with respect to the 0 / 1 -loss. □ 


Proof of Theorem |3| 

We will show that, for every bound on the noise 770 and for every every To > 0 there is an op > 0, such that 

the unit length vector w has strictly lower r-hinge loss than the unit length vector w* for all t > tq. By 

Lemma [7] this implies that for every bound on the noise r/o and for every to there is an op > 0 such that for 
all 0 < A < 1 we have C To {Xw) < C T{) (Xw*). This implies that the hinge minimizer is not a multiple of w* 
and so is at a positive angle to w*. Now Lemma [8] tells us that hinge loss minimization is not consistent for 
the 0/1-loss. 

In the sequel, we will now focus on the unit length vectors w and w* and show 
how to choose op a function of to and 770 - We let cA denote the hinge loss of 
h w * on one wedge (one half of) area A when the labels are correct and dA that 
hinge loss on that same area when the labels are not correct. Analogously, we define 
cB, dB, cD and dD. Lor example, for t > 1, we have (integrating the hinge loss 
over the unit ball using polar coordinates) 


h 

w* 

h w 



D \ 

\ w 

\ \ a 

^2---""a/2 \ 

\ D 

\ B / 




1 


"1 ra 

cA = — / / (1-sin (<£>))2 dip dz, 

k Jo Jo T 

1 f 1 f a z 

dA = — / / (1-|— sin((^))z dip dz, 

n Jo Jo T 

^ 7 r+a 

cB = — / / (1-sin(<p ))2 dp dz, 

^ J 0 J a T 


1 


ft 


7r+a: 
2 


dB = — / / (1-|—sin (p))z dp dz, 

^ Jo Jot T 


1 


»1 


7T —Q- 
2 


cD = — / / (1-sin (<£>))2 dp dz, 

K Jo Jo T 


1 


and dD = — / / (1 + — sin(ip))z dp dz. 

ft Jo Jo T 


Ligure 3: D, 


a,r] 
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Now we can express the hinge loss of both h w * and h w in terms of these quantities. For h w * we have 


C T (h w *) = 2 • (?y(dA + dB) + (1 - 77) (cA + cB) + cD). 


For h w , note that area B relates to h w as area D relates to h w * (and vice versa). Thus, the roles of B and D 
are exchanged for h w . That is, for example, for the noisy version of area B the classifier h w pays dD. We 
have 

C T {h w ) = 2 • (77 (cA + dD) + (1 - 17 ) (dA + cD) + cB). 

This yields 


C T (h w ) — C T (h w *) = 2 • ((1 — 2 ? 7 )(dA — cA) — r/((dB — cB) — (dD 

We now define area C as the points at angle between tt — a/2 and tt + a/2 from 
w* (See Figure 3). We let cC and dC be defined analogously to the above. 

Note that dA + dB — dD = dC and cA + cB — cD = cC. Thus we get 

C, T {hw) £- T (h w *) 

=2 • ((1 - 2?y)(dA - cA) - ? ? ((dB - cB) - (dD - cD))) 

=2 • ((1 — r])(dA — cA) — 7 /((dB — cB) + (dA — cA) — (dD — cD))) 

=2 • ((1 — r])(dA — cA) — r]((dC — cC))). 


- cD))). 



Figure 4: Area C 


If 77 > p(a,r) := ( d A-cA)+(dC-cC) ’ then we § et ^r(h w ) - C T {h w «) < 0 and 

thus h w having smaller hinge loss than h w *. Thus, r)(a, r) signifies the amount of noise from which onward, 
w will have smaller hinge loss than w* 

Given tq > 0, choose a small enough (we can always choose the angle a sufficiently small for this) so 
that the area A is included in the TQ-band around w*. We have for all r > tq: 


(dA — cA) = 


2 f 1 ^ — sin (<p) dip dz 

K Jo Jo T 

2 f a 1 

- y o - mv) if 

4h[-cosfc>)B 
2 . 

(1 — cos(a)). 


37TT 


For the area C we now consider the case of r > 1 and r < 1 separately. For t > 1 we get 


4 f 1 ft 

(dC-cC) = - 1 1 

TT 

” “ 2 



— sin(< 7 ?) dip dz 

tt — a 7” 


4 [2 1 

= — - sin(y>) dip 

OTT J K-a T 


37 TT 
4 

37 TT 


COS 


tt — a 


a 

sm 1 — 
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Thus, for r > 1 we get 


??(a,r) = 


(dA — cA) 


1 — cos(a) 


(dA — cA) + (dC + cC) 1 — cos(a) + 2 sin(a/2) 
We call this quantity 771 (a) since, given that r > 1, it does not depend on r: 

(dA — cA) 1 — cos (a) 


m (a) = 


(dA — cA) + (dC + cC) 1 — cos(a) + 2 sin(a/2) 


Observe that lim a _>.o 771 (a) = 0. This will yield the first condition on the angle a: Given some bound on 
the allowed noise 770 , we can choose an a small enough so that 771 (a) < 770 / 2 . Then, for the distribution 


D a ,iio we have C T (iv) < C T (w*) for all r > 1. 


We now consider the case r < 1. For this case we lower bound (dC — cC) as follows. We have 



9 7*1 
dC = - 

^ JO 
a 2 
27T 7T 


2 

z -j -sin((/?) dip dz 

r 


TT — a. 
2 





a 2 
2ir ^ 3t7t 


2 z 

— sin(</>) d(p dz 

■K — OL T 
2 

a 


Sm 1 2 


We now provide an upper bound on cC by integrating over a the triangular shape 
T (see Figure 4). Note that this bound on cC is actually exact if r < cos(a/2) and 
only a strict upper bound for cos(a/2) < r < 1. We have 

cC < (cT) = — • [ (1-)(ztan(a/2)) dz 



77 

2 

7T 


r z 2 

/ ztan(a/ 2 )-tan(a/ 2 ) dz 

Jo T 


Figure 5: Area T 


, a 

= — tan ( — 
37T 


Thus we get 


(dC - cC) > (dC - (cT)) = 


1 


7T 


a 2 . 

—I-sm 

2 3r 


T 

- - tan 


This yields, for the case r < 1 


„(„ T) = __ i* 1 - ” 5 < a » _ 

|(1 — cos(a)) + | sin(a) + ~ -y tan(|) 

We call this quantity 772 ( 0 , r) to differentiate it from 771 (a). Again, it is easy to show that we have 
lim Q ^o 772 ( 0 , r) = 0 for every r. Thus, for a fixed To, we can choose an angle a small enough so that 

£r 0 M < Ct 0 {w*). 
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To argue that we will then also have C T {w) < C T (w*) for all r > To, we show that, for a fixed angle 

3 

a , the function 77(0, r) gets smaller as r grows. For this, it suffices to show that g(j) = ^ tan( “) is 

monotonically increasing with r for r < 1. We have 





Since we have t 2 < 1 and 


2a 

tan(f) 


> 1 for 0 < a < it/ 3, we get that (for sufficiently small a ) g'(r ) > 0 


and thus g(r) is monotonically increasing for 0 < r < 1 as desired. 

Summarizing, for a given To and r/o, we can always choose qo sufficiently small so that both rg (ccq) < ® 

and 7/2(00, t) < ^ for all t > tq and thus £^ , “ 0 ’ ,70 (u>) < C ^ 010 '* 10 («;*) for all t > to- This completes the 
proof. 
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